01 - R framework with IMPACT - session 1

Author

Yann Say

Published

May 7, 2024

IMPACT R framework

The IMPACT R framework has been developed to meet the requirements of the IMPACT research cycle.

The ecosystem is a modular framework with two dimensions:

  • a horizontal dimension that focuses on the outcome of a given step, and
  • a vertical dimension that focuses on the content of a given step.

The framework is built around:

  • 4 steps:
    • Cleaning: any manipulation to go from the raw data to the clean data
    • Composition: any manipulation before the analysis e.g. adding indicators, adding information from loop or main, aok aggregation, etc.
    • Analysis: any manipulation regarding only the analysis
    • Outputs: any manipulation to format the outputs.
  • 4 verbs:
    • Add: functions that will add a variable (column) to the dataset.
    • Check: functions that will flag values based on a specific check. It will return them in a log. A check_* will return a list: the checked dataset, and the log. The function takes a dataset as input and returns the dataset + the new indicator (and any intermediate steps used for the calculation).
    • Create: functions that will create, transform something, e.g. creating a cleaning log with the checks to be filled, create analysis results table, create an output. Outputs from create_* functions outputs can be in different shape, format, etc.
    • Review: functions that will review an object by comparing it to standards or another object and flags differences, e.g. reviewing the cleaning by comparing the raw dataset, the clean dataset and the cleaning log, analysis comparing it with another analysis.
  • 2 adjectives:
    • Pipe-able: In the framework, functions of the same family should be pipe-able. In the following case, 2 check_* functions are piped.
    • Independent: At any given step, the user can change tool. Each input and each output of a step should follow the same format and definition.

These elements will help to improve cooperation and collaboration between different teams while allowing modularity to adapt to each context and assessment.

Cleaning - Checking a dataset

The following section will present some introduction about the composition.

library(cleaningtools)
library(dplyr)

my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choices
check_*

check_* functions will flag values based on a specific check. It will return them in a log. A *check_** will return a list: the checked dataset, and the log.

check_* functions are used only in the cleaning step.

check_outliers

my_log1 <- my_raw_dataset %>% 
  check_outliers(uuid_column = "X_uuid")

In this example, there are:

  • checked_dataset: the raw dataset (with extra variables if needed)
  • potential_outliers: a log of potential outliers
typeof(my_log1)
[1] "list"
my_log1 %>% 
  names()
[1] "checked_dataset"    "potential_outliers"

The log has at least 4 columns:

  • uuid: the unique identifier
  • issue: the issue being flagged
  • question: the name of the question
  • old_value: the value being flagged
my_log1$potential_outliers %>% 
  head()
uuid issue question old_value
b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 outlier (normal distribution) age_respondent_r 86
956b5ed0-5a62-41b7-aec3-af93fbc5b494 outlier (normal distribution) age_respondent_r 84
97ad6294-30c6-454e-a0b3-42126415b767 outlier (log distribution) age_respondent_r 18
e005e719-57c4-44a3-ac2f-5d6d1ff68831 outlier (log distribution) age_respondent_r 18
c9aaa542-118f-4e42-93de-fb0916572541 outlier (normal distribution) num_hh_member 19
48e8896b-d1be-4600-8839-2d8b994ebcfb outlier (normal distribution) num_hh_member 19
Note

Outliers are defined as +/- 3 standard deviation from the mean.

For log outliers, log(x + 1) is used.

check_duplicate

my_log2 <- my_raw_dataset %>% 
  check_duplicate(uuid_column = "X_uuid")

my_log2$duplicate_log %>% 
  head()
uuid old_value question issue

There is no duplicate. The log is empty.

Pipe-able

Pipe-able

The framework is built around 2 adjectives, pipe-able and independent. In the framework, functions of the same family should be pipe-able. In the following case, 2 check_* functions are piped.

my_log3 <- my_raw_dataset %>% 
  check_outliers(uuid_column = "X_uuid") %>% 
  check_duplicate(uuid_column = "X_uuid")
names(my_log3)
[1] "checked_dataset"    "potential_outliers" "duplicate_log"     
my_log3$potential_outliers %>% 
  head()
uuid issue question old_value
b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 outlier (normal distribution) age_respondent_r 86
956b5ed0-5a62-41b7-aec3-af93fbc5b494 outlier (normal distribution) age_respondent_r 84
97ad6294-30c6-454e-a0b3-42126415b767 outlier (log distribution) age_respondent_r 18
e005e719-57c4-44a3-ac2f-5d6d1ff68831 outlier (log distribution) age_respondent_r 18
c9aaa542-118f-4e42-93de-fb0916572541 outlier (normal distribution) num_hh_member 19
48e8896b-d1be-4600-8839-2d8b994ebcfb outlier (normal distribution) num_hh_member 19
my_log3$duplicate_log %>% 
  head()
uuid old_value question issue

More checks

This an example of more checks that exist.

more_logs <- my_raw_dataset %>% 
  check_duplicate(uuid_column = "X_uuid") %>% 
  check_soft_duplicates(uuid_column = "X_uuid", kobo_survey = my_kobo_survey, sm_separator = ".") %>%
  check_outliers(uuid_column = "X_uuid") %>%
  check_value(uuid_column = "X_uuid") 

add_duration

add_*

add_* functions will add a variable (column) to the dataset. For example, to add the duration of a survey, to add the food consumption score category, etc.

add_* function takes a dataset as input and returns the dataset + the new indicator (and any intermediate steps used for the calculation).

For example, to check the duration of a survey, there is only the start and end, but not the duration column.

more_logs$checked_dataset <- more_logs$checked_dataset %>% 
  add_duration(uuid_column = "X_uuid", start_column = "X.U.FEFF.start", end_column = "end")
more_logs$checked_dataset[1:6, c("start_date", "start_time", "end_date", "end_time", "days_diff", "duration")]
start_date start_time end_date end_time days_diff duration
2021-07-05 658.57 mins 2021-07-05 696.68 mins 0 days 38.11
2021-07-05 608.90 mins 2021-07-05 641.92 mins 0 days 33.02
2021-07-05 682.23 mins 2021-07-05 726.43 mins 0 days 44.20
2021-07-04 1342.98 mins 2021-07-04 1380.15 mins 0 days 37.17
2021-07-04 1391.62 mins 2021-07-05 18.88 mins 1 days 67.26
2021-07-05 617.38 mins 2021-07-05 756.52 mins 0 days 139.14
Warning

The duration is added to the checked_dataset in the list, not in the my_raw_dataset dataframe. The check_* functions are used in a pipe, so it needs the current dataset to be modified.

Warning

At the moment, add_duration takes very specific format. It will change in the future to become more robust and using lubridate.

check_duration can now be used with the previous checks.

more_logs <- more_logs %>% 
  check_duration(column_to_check = "duration", uuid_column = "X_uuid")

As much as possible, check_* functions take default argument or the functions will be able to guess some information, e.g. the check_outliers function guesses some numerical values. Some functions need more information.

other/text columns

check_other needs the list of columns to be checked. It currently, it cannot detect the open text question. KOBO tool can be used.

other_columns_to_check <- my_kobo_survey %>% 
  dplyr::filter(type == "text") %>% 
  dplyr::filter(name %in% names(my_raw_dataset)) %>%
  dplyr::pull(name) 

more_logs <- more_logs %>% 
  check_others(uuid_column = "X_uuid", columns_to_check = other_columns_to_check) 

check_logical_with_list

In other cases, the check is specific and should be tailored to the dataset, for example, check_logical_with_list. All the logical checks can be recorded in an excel file.

logical_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")
logical_check_list
check_id description check_to_perform columns_to_clean
check_1 primary_livelihood is employment but expenses less than 200000 primary_livelihood.employment == 1 & tot_expenses < 200000 primary_livelihood.employment,
tot_expenses
check_2 acces water and tank emptied access_water_enough == “totally_insufficient” & tank_emptied == “about_half” access_water_enough, tank_emptied

The check list has 4 columns:

  • check_id : the name of the check
  • description: the description of the check
  • check_to_perform: the check to perform. The format for the check to be performed should take the format based on tidyverse. That format is as if a new indicator is create with a mutate. That new indicator should be a logical (i.e. TRUE or FALSE) with TRUE being the value to flag.

This list can then be used with check_logical_with_list.

example_logic <- my_raw_dataset %>% 
  check_logical_with_list(uuid_column = "X_uuid",
                          list_of_check = logical_check_list,
                          check_id_column = "check_id",
                          check_to_perform_column = "check_to_perform",
                          columns_to_clean_column = "columns_to_clean",
                          description_column = "description")
example_logic$logical_all %>% 
  head()
uuid question old_value issue check_id check_binding
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb primary_livelihood.employment TRUE primary_livelihood is employment but expenses less than 200000 check_1 check_1 / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb
f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb tot_expenses 125000 primary_livelihood is employment but expenses less than 200000 check_1 check_1 / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb
e9f8b44c-c507-45a1-8d76-66d886437b8f primary_livelihood.employment TRUE primary_livelihood is employment but expenses less than 200000 check_1 check_1 / e9f8b44c-c507-45a1-8d76-66d886437b8f
e9f8b44c-c507-45a1-8d76-66d886437b8f tot_expenses 175000 primary_livelihood is employment but expenses less than 200000 check_1 check_1 / e9f8b44c-c507-45a1-8d76-66d886437b8f
994a60b8-e640-425c-9774-160651d7af04 primary_livelihood.employment TRUE primary_livelihood is employment but expenses less than 200000 check_1 check_1 / 994a60b8-e640-425c-9774-160651d7af04
994a60b8-e640-425c-9774-160651d7af04 tot_expenses 175000 primary_livelihood is employment but expenses less than 200000 check_1 check_1 / 994a60b8-e640-425c-9774-160651d7af04

The log returns :

  • uuid
  • question: for all variables in columns_to_clean
  • old value: for all variables in columns_to_clean
  • issue
  • check_id: logical check identifier
  • check_binding: the combination of the check_id and the uuid.

One check can be flagged in several rows, in the example above, for each uuid, the primary_livelihood and tot_expenses are flagged.

Format for the check_to_perform should take the format based on tidyverse. That format is as if a new indicator is create with a mutate. That new indicator should be a logical (i.e. TRUE or FALSE) with TRUE being the value to flag.

my_raw_dataset %>% 
  dplyr::mutate(xxx =  primary_livelihood.employment == 1 & tot_expenses < 200000) %>% 
  dplyr::select(X_uuid, xxx, primary_livelihood.employment, tot_expenses) %>% 
  head()
X_uuid xxx primary_livelihood.employment tot_expenses
dcf2753a-6ea2-40f5-b493-3527931ef96c FALSE FALSE 250000
8790ce5c-1c35-41a2-b3c0-538f937d5397 FALSE TRUE 750000
bb818e04-9c40-408e-919f-6b40ff1fdbb3 FALSE FALSE 250000
28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 FALSE TRUE 600000
7f2a0c6a-529b-481f-963f-a96dca2ec034 FALSE TRUE 500000
b4f92064-12ea-4970-b0f5-fd309de1dda3 FALSE FALSE 650000

The checked dataset will be return with extra columns, i.e. a logical variable with the name of the check_id.

example_logic$checked_dataset[1:6,tail(names(example_logic$checked_dataset))]
X_notes X_status X_submitted_by X_index check_1 check_2
[] submitted_via_web reach_irq 1 FALSE FALSE
[] submitted_via_web reach_irq 2 FALSE FALSE
[] submitted_via_web reach_irq 3 FALSE FALSE
[] submitted_via_web reach_irq 4 FALSE FALSE
[] submitted_via_web reach_irq 5 FALSE FALSE
[] submitted_via_web reach_irq 6 FALSE FALSE
Note

If you don’t include columns_to_clean the check_logical_with_list function will try to guess the variables. Not guarantee it will read or pick the correct names.

Exercises

Exercise 1

Try the following with a dataset:

  • Perform a check to spot personal identifiable information
library(cleaningtools)
library(dplyr)

my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choices

Try the function check_pii

my_raw_dataset %>% 
  check_pii(uuid_column = "X_uuid")

Exercise 2

  • Perform a check that will look at the percentages of missing value per observation and that will spot any observation that is different.

Try the function check_percentage_missing

Did you try to add a new column with add_percentage_missing

my_raw_dataset <- my_raw_dataset %>% 
  add_percentage_missing(kobo_survey = my_kobo_survey)
  
my_raw_dataset %>% 
  check_percentage_missing(uuid_column = "X_uuid")

Or if using a log already

more_logs$checked_dataset <- more_logs$checked_dataset %>% 
  add_percentage_missing(kobo_survey = my_kobo_survey)
more_logs %>% 
  check_percentage_missing(uuid_column = "X_uuid")

Exercise 3

  • Fill the excel checklist to do the following checks:
    • household number (variable: num_hh_member) is above 8.
    • the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and the household always treat the drinking water (variable: treat_drink_water, value: always_treat).
    • the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and one of the main reason for the the household to not meet its water needs is the water pressure (variable: access_water_enough_why_not, value: water_pressure, this is a select multiple)
name label::English (en) type
num_hh_member How many members are there in your household (including you)? integer
water_source_drinking What is the main source of water used by your household for drinking? select_one water_sources
treat_drink_water Does your household treat this water in any way to make it safer to drink? select_one treat
access_water_enough_why_not What are the main reasons your household is not able to meet its water needs? select_multiple barriers_water_needs
name label::English (en) list_name
bottled Bottled or sachet water water_sources
always_treat Yes, always treat it before drinking treat
water_pressure Water pressure is not high enough/pumps required barriers_water_needs
exercise_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")

my_raw_dataset %>% 
    check_logical_with_list(uuid_column = "X_uuid",
                          list_of_check = exercise_check_list,
                          check_id_column = "check_id",
                          check_to_perform_column = "check_to_perform",
                          columns_to_clean_column = "columns_to_clean",
                          description_column = "description")
my_check_list <- data.frame(check_id = c("check_household number", "check_water_treatment", "check_3"),
                            description = c("num_hh_member is big","using bottled water and always treat","using bottled water and main reason is water pressure"),
                            check_to_perform = c("num_hh_member > 8","water_source_drinking == \"bottled\" & treat_drink_water == \"always_treat\"","water_source_drinking == \"bottled\" & access_water_enough_why_not.water_pressure == TRUE"),
                            columns_to_clean = c("num_hh_member","water_source_drinking, treat_drink_water","water_source_drinking, access_water_enough_why_not.water_pressure"))

my_raw_dataset %>% 
    check_logical_with_list(uuid_column = "X_uuid",
                          list_of_check = my_check_list,
                          check_id_column = "check_id",
                          check_to_perform_column = "check_to_perform",
                          columns_to_clean_column = "columns_to_clean",
                          description_column = "description")

Extra

  • Try to add duration with audit files.

Try create_audit_list() and add_duration_from_audit()